fix: extract Docling async markdown result by he-yufeng · Pull Request #3031 · HKUDS/LightRAG

he-yufeng · 2026-05-07T15:48:20Z

Summary

align the Docling async parser defaults with docling-serve v1: port 5001 examples, files upload field, task_status polling, and /v1/status/poll/{task_id}
add a result URL template fallback for services that return only task status while exposing results at /v1/result/{task_id}
extract document.md_content from the Docling result JSON instead of returning the whole response envelope as raw document text

Verified locally

python -m pytest tests\test_pipeline_release_closure.py -q -k "docling or mineru_empty_service_result"
python -m ruff check lightrag\pipeline.py tests\test_pipeline_release_closure.py
python -m ruff format --check lightrag\pipeline.py
python -m py_compile lightrag\pipeline.py tests\test_pipeline_release_closure.py
git diff --check

Note: running the full tests/test_pipeline_release_closure.py file on Windows still hits two existing path-separator assertions unrelated to this change.

Closes #2996

he-yufeng · 2026-05-08T15:52:09Z

Rebased this onto the latest dev after the pipeline refactor and resolved the lightrag/pipeline.py conflict. The Docling async defaults are now applied in the current parse_docling protocol config rather than the old pre-refactor location.

Validation on the rebased head 606ed098:

python -m uv run pytest tests\test_pipeline_release_closure.py::test_parse_docling_uses_docling_serve_async_defaults tests\test_pipeline_release_closure.py::test_protocol_parse_service_extracts_docling_result_markdown tests\test_pipeline_release_closure.py::test_parse_docling_empty_service_result_raises_without_fallback tests\test_pipeline_release_closure.py::test_parse_mineru_empty_service_result_raises_without_fallback -q
python -m uv run ruff check lightrag\pipeline.py tests\test_pipeline_release_closure.py
python -m py_compile lightrag\pipeline.py tests\test_pipeline_release_closure.py
git diff --check upstream/dev..HEAD

danielaskdd · 2026-05-08T17:26:27Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 606ed098cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

danielaskdd · 2026-05-09T07:26:31Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a69155b8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

danielaskdd · 2026-05-09T12:44:11Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27a31e4400

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T12:46:45Z

+            except Exception:
+                return text
+            content_val = get_by_path(payload, content_field)
+            return _string_content(content_val) or text


Reject empty Docling content instead of indexing envelope

When Docling returns a JSON result whose configured content field is present but empty (for example a blank document, skipped conversion, or a conversion result with document.md_content: ""), this fallback returns the entire JSON envelope as document text. That bypasses the existing Docling parser returned empty content guard in parse_docling and can index status/errors/metadata instead of extracted markdown; distinguish “JSON parsed but content is empty/missing” from “non-JSON raw text” here rather than falling back to text.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T12:46:45Z

+# DOCLING_CONTENT_FIELD=document.md_content
+# DOCLING_FILE_FIELD=files
 # DOCLING_SUCCESS_VALUES=done,success,completed
 # DOCLING_FAILED_VALUES=failed,error,cancelled


Include Docling's failure status in the sample override

If a user uncomments this Docling block, DOCLING_FAILED_VALUES overrides the code default that now includes failure. Docling v1 reports failed async tasks as task_status: "failure", so with this sample configuration a failed conversion is treated as still pending until DOCLING_MAX_POLLS expires instead of raising the parse-service failure promptly; add failure to the example list.

Useful? React with 👍 / 👎.

danielaskdd · 2026-05-09T13:01:03Z

The FileProcessingConfiguration-zh.md has been restructured. Please merge the latest dev branch and update the Docling usage instructions within the new document architecture.

he-yufeng · 2026-05-09T16:58:58Z

Rebased onto the latest dev branch and updated the Chinese file-processing doc in the new structure. The Docling quick-start endpoint now matches the current 5001 examples, and the obsolete duplicated full_docs section from the old layout is gone.

Validation:

python -m py_compile lightrag\pipeline.py tests\test_pipeline_release_closure.py
python -m uv run ruff check lightrag\pipeline.py tests\test_pipeline_release_closure.py
python -m uv run pytest tests\test_pipeline_release_closure.py::test_parse_docling_uses_docling_serve_async_defaults tests\test_pipeline_release_closure.py::test_protocol_parse_service_extracts_docling_result_markdown tests\test_pipeline_release_closure.py::test_protocol_parse_service_raises_on_docling_failure_status tests\test_pipeline_release_closure.py::test_parse_docling_empty_service_result_raises_without_fallback tests\test_pipeline_release_closure.py::test_parse_mineru_empty_service_result_raises_without_fallback -q (5 passed)

he-yufeng force-pushed the fix/docling-markdown-extraction branch from 3c02f33 to 606ed09 Compare May 8, 2026 15:51

danielaskdd added the tracked Issue is tracked by project label May 8, 2026

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread lightrag/pipeline.py

he-yufeng force-pushed the fix/docling-markdown-extraction branch 2 times, most recently from 436945c to 6a69155 Compare May 8, 2026 22:20

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Comment thread docs/FileProcessingConfiguration-zh.md

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

he-yufeng added 2 commits May 10, 2026 00:55

fix: extract Docling async markdown result

b5ebb95

docs: align Docling endpoint examples

b057844

he-yufeng force-pushed the fix/docling-markdown-extraction branch from 27a31e4 to b057844 Compare May 9, 2026 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: extract Docling async markdown result#3031

fix: extract Docling async markdown result#3031
he-yufeng wants to merge 2 commits intoHKUDS:devfrom
he-yufeng:fix/docling-markdown-extraction

he-yufeng commented May 7, 2026

Uh oh!

he-yufeng commented May 8, 2026

Uh oh!

danielaskdd commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

danielaskdd commented May 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

danielaskdd commented May 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Uh oh!

danielaskdd commented May 9, 2026

Uh oh!

he-yufeng commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

he-yufeng commented May 7, 2026

Summary

Verified locally

Uh oh!

he-yufeng commented May 8, 2026

Uh oh!

danielaskdd commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

danielaskdd commented May 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

danielaskdd commented May 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

danielaskdd commented May 9, 2026

Uh oh!

he-yufeng commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants